simdutf8 – High-speed UTF-8 validation
Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementation from simdjson. Originally ported to Rust by the developers of simd-json.rs, but now heavily improved.
Status
This library has been thoroughly tested with sample data as well as fuzzing and there are no known bugs.
Features
basic
API for the fastest validation, optimized for valid UTF-8compat
API as a fully compatible replacement forstd::str::from_utf8()
- Supports AVX 2 and SSE 4.2 implementations on x86 and x86-64
- 🆕 ARM64 (aarch64) SIMD is supported since Rust 1.61
- 🆕 WASM (wasm32) SIMD is supported
- x86-64: Up to 23 times faster than the std library on valid non-ASCII, up to four times faster on ASCII
- aarch64: Up to eleven times faster than the std library on valid non-ASCII, up to four times faster on ASCII (Apple Silicon)
- Faster than the original simdjson implementation
- Selects the fastest implementation at runtime based on CPU support (on x86)
- Falls back to the excellent std implementation if SIMD extensions are not supported
- Written in pure Rust
- No dependencies
- No-std support
Quick start
Add the dependency to your Cargo.toml file:
[]
= "0.1.5"
Use simdutf8::basic::from_utf8()
as a drop-in replacement for std::str::from_utf8()
.
use from_utf8;
println!;
If you need detailed information on validation failures, use simdutf8::compat::from_utf8()
instead.
use from_utf8;
let err = from_utf8.unwrap_err;
assert_eq!;
assert_eq!;
APIs
Basic flavor
Use the basic
API flavor for maximum speed. It is fastest on valid UTF-8, but only checks
for errors after processing the whole byte sequence and does not provide detailed information if the data
is not valid UTF-8. simdutf8::basic::Utf8Error
is a zero-sized error struct.
Compat flavor
The compat
flavor is fully API-compatible with std::str::from_utf8()
. In particular, simdutf8::compat::from_utf8()
returns a simdutf8::compat::Utf8Error
, which has valid_up_to()
and error_len()
methods. The first is useful for
verification of streamed data. The second is useful e.g. for replacing invalid byte sequences with a replacement character.
It also fails early: errors are checked on the fly as the string is processed and once
an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data.
This comes at a slight performance penalty compared to the basic
API even if the input is valid UTF-8.
Implementation selection
X86
The fastest implementation is selected at runtime using the std::is_x86_feature_detected!
macro, unless the CPU
targeted by the compiler supports the fastest available implementation.
So if you compile with RUSTFLAGS="-C target-cpu=native"
on a recent x86-64 machine, the AVX 2 implementation is selected at
compile-time and runtime selection is disabled.
For no-std support (compiled with --no-default-features
) the implementation is always selected at compile time based on
the targeted CPU. Use RUSTFLAGS="-C target-feature=+avx2"
for the AVX 2 implementation or RUSTFLAGS="-C target-feature=+sse4.2"
for the SSE 4.2 implementation.
ARM64
The SIMD implementation is used automatically since Rust 1.61.
WASM32
For wasm32 support, the implementation is selected at compile time based on the presence of the simd128
target feature.
Use RUSTFLAGS="-C target-feature=+simd128"
to enable the WASM SIMD implementation. WASM, at
the time of this writing, doesn't have a way to detect SIMD through WASM itself. Although this capability
is available in various WASM host environments (e.g., wasm-feature-detect in the web browser), there is no portable
way from within the library to detect this.
Building/Targeting WASM
See this document for more details.
Access to low-level functionality
If you want to be able to call a SIMD implementation directly, use the public_imp
feature flag. The validation implementations are then accessible in the simdutf8::{basic, compat}::imp
hierarchy. Traits
facilitating streaming validation are available there as well.
Optimisation flags
Do not use opt-level = "z"
, which prevents inlining and makes
the code quite slow.
Minimum Supported Rust Version (MSRV)
This crate's minimum supported Rust version is 1.38.0.
Benchmarks
The benchmarks have been done with criterion, the tables are created with critcmp. Source code and data are in the bench directory.
The naming schema is id-charset/size. 0-empty is the empty byte slice, x-error/66536 is a 64KiB slice where the very
first character is invalid UTF-8. Library versions are simdutf8 v0.1.2 and simdjson v0.9.2. When comparing
with simdjson simdutf8 is compiled with #inline(never)
.
Configurations:
- X86-64: PC with an AMD Ryzen 7 PRO 3700 CPU (Zen2) on Linux with Rust 1.52.0
- Aarch64: Macbook Air with an Apple M1 CPU (Apple Silicon) on macOS with Rust rustc 1.54.0-nightly (881c1ac40 2021-05-08).
simdutf8 basic vs std library on x86-64 (AMD Zen2)
Simdutf8 is up to 23 times faster than the std library on valid non-ASCII, up to four times on pure ASCII.
simdutf8 basic vs std library on aarch64 (Apple Silicon)
Simdutf8 is up to to eleven times faster than the std library on valid non-ASCII, up to four times faster on pure ASCII.
simdutf8 basic vs simdjson on x86-64
Simdutf8 is faster than simdjson on almost all inputs.
simdutf8 basic vs simdutf8 compat UTF-8 on x86-64
There is a small performance penalty to continuously checking the error status while processing data, but detecting errors early provides a huge benefit for the x-error/66536 benchmark.
Technical details
For inputs shorter than 64 bytes validation is delegated to core::str::from_utf8()
except for the direct-access
functions in simdutf8::{basic, compat}::imp
.
The SIMD implementation is mostly similar to the one in simdjson except that it is has additional optimizations for the pure ASCII case. Also it uses prefetch with AVX 2 on x86 which leads to slightly better performance with some Intel CPUs on synthetic benchmarks.
For the compat API, we need to check the error status vector on each 64-byte block instead of just aggregating it. If an
error is found, the last bytes of the previous block are checked for a cross-block continuation and then
std::str::from_utf8()
is run to find the exact location of the error.
Care is taken that all functions are properly inlined up to the public interface.
Thanks
- to the authors of simdjson for coming up with the high-performance SIMD implementation and in particular to Daniel Lemire for his feedback. It was very helpful.
- to the authors of the simdjson Rust port who did most of the heavy lifting of porting the C++ code to Rust.
License
This code is dual-licensed under the Apache License 2.0 and the MIT License.
It is based on code distributed with simd-json.rs, the Rust port of simdjson, which is dual-licensed under the MIT license and Apache 2.0 license as well.
simdjson itself is distributed under the Apache License 2.0.
References
John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021